AITopics

Technology: Information Technology > Artificial Intelligence (0.51)

arXiv.org Artificial IntelligenceDec-3-2025

BOOM: Beyond Only One Modality KIT's Multimodal Multilingual Lecture Companion

Koneru, Sai, Retkowski, Fabian, Huber, Christian, Hilgert, Lukas, Akti, Seymanur, Ugan, Enes Yavuz, Waibel, Alexander, Niehues, Jan

The globalization of education and rapid growth of online learning have made localizing educational content a critical challenge. Lecture materials are inherently multimodal, combining spoken audio with visual slides, which requires systems capable of processing multiple input modalities. To provide an accessible and complete learning experience, translations must preserve all modalities: text for reading, slides for visual understanding, and speech for auditory learning. We present \textbf{BOOM}, a multimodal multilingual lecture companion that jointly translates lecture audio and slides to produce synchronized outputs across three modalities: translated text, localized slides with preserved visual elements, and synthesized speech. This end-to-end approach enables students to access lectures in their native language while aiming to preserve the original content in its entirety. Our experiments demonstrate that slide-aware transcripts also yield cascading benefits for downstream tasks such as summarization and question answering. We release our Slide Translation code at https://github.com/saikoneru/image-translator and integrate it in Lecture Translator at https://gitlab.kit.edu/kit/isl-ai4lt/lt-middleware/ltpipeline}\footnote{All released code and models are licensed under the MIT License.

large language model, machine learning, translation, (18 more...)

2512.02817

Country:

Asia > Middle East > UAE (0.46)
North America > United States (0.28)
Europe > Austria (0.28)

Genre: Instructional Material > Course Syllabus & Notes (0.49)

Industry: Education > Educational Setting (0.66)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
(2 more...)

Neural Information Processing SystemsNov-21-2025, 14:57:42 GMT

Unsupervised Learning of Spoken Language with Visual Context

Humans learn to speak before they can read or write, so why can't computers do the same? In this paper, we present a deep neural network model capable of rudimentary spoken language acquisition using untranscribed audio training data, whose only supervision comes in the form of contextually relevant visual images. We describe the collection of our data comprised of over 120,000 spoken audio captions for the Places image dataset and evaluate our model on an image search and annotation task. We also provide some visualizations which suggest that our model is learning to recognize meaningful words within the caption spectrograms.

name change, spoken language, unsupervised learning, (2 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Betala, Siddharth, Raj, Kushan, Betala, Vipul, Saswade, Rohan

A Picture is Worth a Thousand (Correct) Captions: A Vision-Guided Judge-Corrector System for Multimodal Machine Translation

arXiv.org Artificial IntelligenceNov-11-2025

In this paper, we describe our system under the team name BLEU Monday for the English-to-Indic Multimodal Translation Task at W AT 2025. We participate in the text-only translation tasks for English-Hindi, English-Bengali, English-Malayalam, and English-Odia language pairs. We present a two-stage approach that addresses quality issues in the training data through automated error detection and correction, followed by parameter-efficient model fine-tuning. Our methodology introduces a vision-augmented judge-corrector pipeline that leverages multimodal language models to systematically identify and correct translation errors in the training data. The judge component classifies translations into three categories: correct, visually ambiguous (requiring image context), or mistranslated (poor translation quality). Identified errors are routed to specialized correctors: GPT-4o-mini regenerates captions requiring visual disambiguation, while IndicTrans2 retranslates cases with pure translation quality issues. This automated pipeline processes 28,928 training examples across four languages, correcting an average of 17.1% of captions per language. We then apply Low-Rank Adaptation (LoRA) to fine-tune the IndicTrans2 en-indic 200M distilled model on both original and corrected datasets.

large language model, machine learning, translation, (20 more...)

2511.0701

Country: North America > United States > Minnesota (0.28)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

arXiv.org Artificial IntelligenceOct-20-2025

Select Less, Reason More: Prioritizing Evidence Purity for Video Reasoning

Li, Xuchen, Li, Xuzhao, Hu, Shiyu, Huang, Kaiqi

Long-form video reasoning remains a major challenge for Video Large Language Models (Video LLMs), as static uniform frame sampling leads to information dilution and obscures critical evidence. Furthermore, existing pixel-space video reasoning agents, which are designed to actively interact with the video to acquire new visual information, remain suboptimal due to their lack of rigorous reward mechanisms to enforce evidence purity and their inability to perform temporal information supplementation beyond pre-sampled frames. T o address this critical gap, we propose a novel evidence-prioritized adaptive framework built upon our core philosophy: "Select Less, Reason More. " Our core contribution is the evidence-aware reinforcement learning (EARL) framework, which transforms the model into an active interrogator of evidence. EARL is precisely engineered to dynamically select the most relevant frames and, crucially, to perform localized re-sampling around the selected key frames to access fine-grained temporal detail. Extensive experiments on five demanding video reasoning benchmarks demonstrate that our EARLtrained model achieves new state-of-the-art among open-source Video LLMs, simultaneously learning an effective and high-purity visual evidence selection policy.

arxiv preprint arxiv, large language model, machine learning, (18 more...)

2510.1544

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

arXiv.org Artificial IntelligenceOct-10-2025

Multimodal Safety Evaluation in Generative Agent Social Simulations

Vera, Alhim, Sanchez, Karen, Hinojosa, Carlos, Hamid, Haidar Bin, Kim, Donghoon, Ghanem, Bernard

Can generative agents be trusted in multimodal environments? Despite advances in large language and vision-language models that enable agents to act autonomously and pursue goals in rich settings, their ability to reason about safety, coherence, and trust across modalities remains limited. We introduce a reproducible simulation framework for evaluating agents along three dimensions: (1) safety improvement over time, including iterative plan revisions in text-visual scenarios; (2) detection of unsafe activities across multiple categories of social situations; and (3) social dynamics, measured as interaction counts and acceptance ratios of social exchanges. Agents are equipped with layered memory, dynamic planning, multimodal perception, and are instrumented with SocialMetrics, a suite of behavioral and structural metrics that quantifies plan revisions, unsafe-to-safe conversions, and information diffusion across networks. Experiments show that while agents can detect direct multimodal contradictions, they often fail to align local revisions with global safety, reaching only a 55 percent success rate in correcting unsafe plans. Across eight simulation runs with three models - Claude, GPT-4o mini, and Qwen-VL - five agents achieved average unsafe-to-safe conversion rates of 75, 55, and 58 percent, respectively. Overall performance ranged from 20 percent in multi-risk scenarios with GPT-4o mini to 98 percent in localized contexts such as fire/heat with Claude. Notably, 45 percent of unsafe actions were accepted when paired with misleading visuals, showing a strong tendency to overtrust images. These findings expose critical limitations in current architectures and provide a reproducible platform for studying multimodal safety, coherence, and social dynamics.

large language model, machine learning, natural language, (19 more...)

2510.07709

Country: Europe (0.28)

Genre: Research Report (0.51)

Industry:

Information Technology (0.46)
Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents > Agent Societies (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.88)

Neural Information Processing SystemsOct-3-2025, 05:12:24 GMT

8f19793b2671094e63a15ab883d50137-AuthorFeedback.pdf

artificial intelligence, final version, visual context, (15 more...)

Technology: Information Technology > Artificial Intelligence (0.51)

Neural Information Processing SystemsOct-3-2025, 03:48:35 GMT

86a1fa88adb5c33bd7a68ac2f9f3f96b-AuthorFeedback.pdf

artificial intelligence, mfp, prediction, (16 more...)

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning (0.48)

Penuela, Ricardo Gonzalez, Arias-Russi, Felipe, Capriles, Victor

Guiding Multimodal Large Language Models with Blind and Low Vision People Visual Questions for Proactive Visual Interpretations

arXiv.org Artificial IntelligenceOct-3-2025

Multimodal large language models (MLLMs) have been integrated into visual interpretation applications to support Blind and Low Vision (BLV) users because of their accuracy and ability to provide rich, human-like interpretations. However, these applications often default to comprehensive, lengthy descriptions regardless of context. This leads to inefficient exchanges, as users must go through irrelevant details rather than receiving the specific information they are likely to seek. To deliver more contextually-relevant information, we developed a system that draws on historical BLV users questions. When given an image, our system identifies similar past visual contexts from the VizWiz-LF dataset and uses the associated questions to guide the MLLM generate descriptions more relevant to BLV users. An evaluation with three human labelers who revised 92 context-aware and context-free descriptions showed that context-aware descriptions anticipated and answered users' questions in 76.1% of cases (70 out of 92) and were preferred in 54.4% of comparisons (50 out of 92). Our paper reviews, and data analysis are publicly available in a Github repository at https://github.com/rgonzalezp/guiding-multimodal-large-language-models-with-blind-and-low-vision-people-visual-questions .

information, large language model, natural language, (17 more...)

2510.01576

Country: North America (0.15)

Genre: Research Report > New Finding (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

arXiv.org Artificial IntelligenceSep-24-2025

Steering Multimodal Large Language Models Decoding for Context-Aware Safety

Liu, Zheyuan, Xu, Zhangchen, Dou, Guangyao, Yuan, Xiangchi, Tan, Zhaoxuan, Poovendran, Radha, Jiang, Meng

Multimodal Large Language Models (MLLMs) are increasingly deployed in real-world applications, yet their ability to make context-aware safety decisions remains limited. Existing methods often fail to balance oversensitivity (unjustified refusals of benign queries) and undersensitivity (missed detection of visually grounded risks), leaving a persistent gap in safety alignment. To address this issue, we introduce Safety-aware Contrastive Decoding (SafeCoDe), a lightweight and model-agnostic decoding framework that dynamically adjusts token generation based on multimodal context. SafeCoDe operates in two stages: (1) a contrastive decoding mechanism that highlights tokens sensitive to visual context by contrasting real and Gaussian-noised images, and (2) a global-aware token modulation strategy that integrates scene-level reasoning with token-level adjustment to adapt refusals according to the predicted safety verdict. Extensive experiments across diverse MLLM architectures and safety benchmarks, covering undersensitivity, oversensitivity, and general safety evaluations, show that SafeCoDe consistently improves context-sensitive refusal behaviors while preserving model helpfulness.

arxiv preprint arxiv, large language model, machine learning, (17 more...)

2509.19212

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)